---------------------------------------------- Part One ---------------------------------------------

1

Question: Please refer the table below to answer below questions:

  1. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order.
  2. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase.

2

Question: An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.

A. Probability that none of the items are defective?

B. Probability that exactly one of the items is defective?

C. Probability that two or fewer of the items are defective?

D. Probability that three or more of the items are defective ?

3

Question: A car salesman sells on an average 3 cars per week.

A. Probability that in a given week he will sell some cars.

B. Probability that in a given week he will sell 2 or more but less than 5 cars.

C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold perweek.

4

Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.

A. What is the probability that all three orders will be recognised correctly?

B. What is the probability that none of the three orders will be recognised correctly?

C. What is the probability that at least two of the three orders will be recognised correctly?

5

A group of 300 professionals sat for a competitive exam. The results show the information of marks obtained by them have a mean of 60 and a standard deviation of 12. The pattern of marks follows a normal distribution. Answer the following questions.

A. What is the percentage of students who score more than 80.

B. What is the percentage of students who score less than 50.

C. What should be the distinction mark if the highest 10% of students are to be awarded distinction?

6

Explain 1 real life industry scenario [other than the ones mentioned above] where you can use the concepts learnt in this module of Applied statistics to get a data driven business solution.

Answer

A scenario where poisson's distribution can be used is for staffing decisions in a large retail outlet.

Staffing decisions at different at the retail outlet at different times would depend on the probable (pre-decided cut-off for probability) lowest number of customers/sales at the different times of the day (calculated based on an observed average number of customers/sales at the different times) and whether this calculated value of sales covers the cost of wages for the employees.

---------------------------------------------- Part Two ---------------------------------------------

Oldest Teams

Newest Teams

Best Performing Team in different metrics

Worst Performing Team

Plots

Tournaments Played

Scores

Games Played

Games Won

Number of Baskets Scored

Number of Top Two Finishes

Time since Team Launch in Years

Multivariate Analysis

Since the problem statement states that we need to look at teams which are younger and high performing. All our metrics need to be looked at with Years Since Launch in mind

Bivariate Analysis

Average Score per Tournament vs Years Since Launch

From the above plot Team 21, Team 39, Team 25, Team 44, Team 48, Team 49 and Team 37 look good contenders. However, it is important to test for a minimum number of Tournaments played.
With the above analysis Team 21, Team 25, Team 37 and Team 39 look good contenders

Games winning percentage with respect to Years since Launch

From the above plot Team 21, Team 39, Team 25 and Team 44 look good contenders. However, it is important to test for a minimum number of games played.
All four teams identified have played atleast 100 matches and therefore can be said to have decent win percentage among newer teams

Average Loss Per Game

From the above plot Team 21 and Team 39 look like the only two good contenders.

Wins Vs Losses

Team 21 is the only new team that wins more than it loses. Teams 39, 25 and 44 are the next best new teams

Positive Basket difference

Team 21 is the only new team to have a positive basket difference. The next best teams are Teams 56, 44 and 57. Teams 25 and 37 have a negative basket difference which is greater than 100. Therefore they concede more than they score

Conclusion:

With the above analysis we can conclude that the best team to approach is Team 21 followed by Team 44 and then teams 39 and 25. Team 44 is chosen above teams 39 and 25 because of higher win percentage.

Data Observations and Recommendations:

  1. Data was incomplete for Team 61. The Association should try and get complete data

  2. The Data was not formatted correctly. For example we had dashes in place of 0 or NA. for the TeamLaunch column the year format was not consistent. The recommendation is to capture data in the right and consisted format

  3. The Column Names were inconsistent. For example the format of column name RunnerUp was initially Runner-up

  4. The data seemed to be incorrect for some of the older teams. Although they had been launched much earlier, the no. of games played etc were much less. This needs to be checked.

---------------------------------------------- Part Three ---------------------------------------------

Data Warehouse

Read CSV

Data Exploration

Check Data Types of each column

All six columns are object datatype

The Data in the all columns are in the string format

Check for NA/nulls in each column
  1. Column Startup, Event, Result and OperatingState have no null values
  2. Column Product has 6 null values and Column Funding has 214 null values

Data Preprocessing and Visualization

Drop Null Values
Convert the ‘Funding’ features to a numerical value.
Plot box plot for funds in millions
Lower fence from boxplot

Minimum value is $ 1 Million

Upper Fence is $ 35.5 Million

Number of outliers greater than upper fence
Drop the values greater than Upper Fence i.e $ 35.5 Million
Plot boxplot after dropping the values above $ 35.5 Million
Check Frequency of OperatingState features classes
Plot a distribution plot for Funds in million
Distribution plot for Companies still Operating
Distribution plot for Companies that closed

Statistical Analysis

Is there any significant difference between Funds raised by companies that are still operating vs companies that closed down?
Write the null hypothesis and alternative hypothesis.
Test for significance and conclusion

Null Hypothesis (Ho): There is no significant difference between the funding raised by companies that are still operating and those that are closed.

Alternate Hypthesis (Ha): There is a significant difference between the funding raised by companies that are still operating and those that are closed.

since the p value is greater than 0.05 we fail to reject the null hypothesis.

Therefore we conclude that there is no significant difference between the funding raised by companies that are still operating and those that are closed

Frequency distribution of Result Variable
Calculate percentage of winners that are still operating and percentage of contestants that are still operating
Write your hypothesis comparing the proportion of companies that are operating between winners and contestants:
Write the null hypothesis and alternative hypothesis.
Test for significance and conclusion

Null Hypothesis (Ho): There is no significant difference between the proportion of winners that are still operating and the proportion of contestants that are still operating

Alternate Hypthesis (Ha): There is a significant difference between the proportion of winners that are still operating and the proportion of contestants that are still operating

Since the pvalue is greater than 0.025 therefore we fail to reject the null hypothesis. Which means that there is no significant difference between the proportion of winners that are still operating and the proportion of contestants that are still operating

Being a winner did not give any significant advantage to the winning company in terms of continuing operations from the competition time to the time of recording the data

Frequency distribution of Event Variable
Events that have Disrupt keyword from 2013 onwards
Write and perform your hypothesis along with significance test comparing the funds raised by companies across NY, SF and EU events from 2013 onwards.

Null Hypothesis (Ho): There is no significant difference between the funding raised by companies across NY, SF and EU events.

Alternate Hypthesis (Ha): There is a significant difference between the funding raised by companies across NY, SF and EU events.

since the p value is greater than 0.05 we fail to reject the null hypothesis.

Therefore we conclude that there is no significant difference between the funding raised by companies in events across NY, SF and EU

Plotting the distribution of Funding across the three different cities to compare them
Write your observations on improvements or suggestions on quality, quantity, variety, velocity, veracity etc. on the data points collected to perform a better data analysis.
  1. Data Quality could be better in terms of missing values. This data set had around 216 rows of data that had to be discarded from 663 rows. Moreover data could be improved with better formatting for example the funding column could have been in float format to reduce effort and chances of errors.
  1. Data Quantity reduced because of missing data. Assuming clean, correct and relevant data, more the data better the analysis.
  1. Variety and Velocity may not be important with respect to this dataset the data is in just one tabular format and as it is an offline exercise.
  1. In terms of veracity, the dataset may have some mistakes but they are difficult to understand by a data scientist who may not have domain knowledge. One thing that I could notice was the presence of NY and NYC in the Location data as a part of 'Event' column. There is a high possibility that both NYC and NY refer to New York but they are captured differently therefore the analysis with location becomes unreliable.